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Abstract 

We present doubly stochastic gradient MCMC, 
a simple and generic method for (approximate) 
Bayesian inference of deep generative models 
(DGMs) in a collapsed continuous parameter 
space. At each MCMC sampling step, the al¬ 
gorithm randomly draws a mini-batch of data 
samples to estimate the gradient of log-posterior 
and further estimates the intractable expectation 
over hidden variables via a neural adaptive im¬ 
portance sampler, where the proposal distribu¬ 
tion is parameterized by a deep neural network 
and learnt jointly. We demonstrate the effective¬ 
ness on learning various DGMs in a wide range 
of tasks, including density estimation, data gen¬ 
eration and missing data imputation. Our method 
outperforms many state-of-the-art competitors. 

1. Introduction 

Learning deep models that consist of multi-layered rep¬ 
resentations has obtained state-of-the-art performance in 
many tasks (Bengio et al., 2014; Hinton et al., 2006), partly 
due to their ability on capturing high-level abstractions. As 
an important family of deep models, deep generative mod¬ 
els (DGMs) (Hinton et al., 2006; Salakhutdinov & Hinton, 
2009) can answer a wide range of queries by performing 
probabilistic inference, such as inferring the missing val¬ 
ues of input data, which is beyond the scope of recognition 
networks such as deep neural networks. 

However, probabilistic inference with DGMs is challeng¬ 
ing, especially when a Bayesian formalism is adopted, 
which is desirable to protect the DGM from overfit¬ 
ting (MacKay, 1992; Neal, 1995) and to perform sparse 
Bayesian inference (Gan et al., 2015b) or nonparametric 
inference (Adams et al., 2010) to learn the network struc¬ 
ture. For Bayesian methods in general, the posterior infer¬ 
ence often involves intractable integrals because of several 
potential factors, such as that the space is extremely high¬ 


dimensional and that the Bayesian model is non-conjugate. 
To address the challenges, approximate methods have to 
be adopted, including variational (Jordan et al., 1999; Saul 
et al., 1996) and Markov chain Monte Carlo (MCMC) 
methods (Robert & Casella, 2005). 

Much progress has been made on stochastic variational 
methods for DGMs (Kingma & Welling, 2014; Rezende 
et al., 2014; Ranganath et al., 2014), under some mean-field 
or parameterization assumptions. One key feature of such 
variational methods is that they marry ideas from deep neu¬ 
ral networks to parameterize the variational distribution by 
a recognition network and jointly learn the parameters by 
optimizing a variational bound. In contrast, little work has 
been done on extending MCMC methods to learn DGMs in 
a Bayesian setting, which are often more accurate, except a 
few exceptions. Gan et al. (2015b) present a Gibbs sampler 
for deep sigmoid belief networks with a sparsity-inducing 
prior via data augmentation, Adams et al. (2010) present 
a Metropolis-Hastings method for cascading Indian buffet 
process and Li et al. (2016) develop a high-order stochastic 
gradient MCMC method and apply to deep Poisson factor 
analysis (Gan et al., 2015a). 

In this paper, we present a simple and generic method, 
named doubly stochastic gradient MCMC, to improve the 
efficiency of performing Bayesian inference on DGMs. By 
drawing samples in the collapsed parameter space, our 
method extends the recent work on stochastic gradient 
MCMC (Welling & Teh, 2011; Ahn et al., 2012; Chen et al., 
2014; Ding et al., 2014) to deal with the challenging task of 
posterior inference with DGMs. Besides the stochasticity 
of randomly drawing a mini-batch of samples in stochastic 
approximation, our algorithm introduces an extra dimen¬ 
sion of stochasticity to estimate the intractable gradients 
by randomly drawing the hidden variables in DGMs. The 
sampling can be done via a Gibbs sampler, which however 
has a low mixing rate in high dimensional spaces. To ad¬ 
dress that, we develop a neural adaptive importance sam¬ 
pler (NAIS), where the adaptive proposal is parameterized 
by a recognition network and the parameters are optimized 
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by descending inclusive KL-divergence. By combining the 
two types of stochasticity, we construct an asymptotically 
unbiased estimate of the gradient in the continuous param¬ 
eter space. Then, a stochastic gradient MCMC method is 
applied with guarantee to (approximately) converge to the 
target posterior when the learning rates are set under some 
proper annealing scheme. 

Our method can be widely applied to the DGMs with either 
discrete or continuous hidden variables. In experiments, we 
demonstrate the efficacy on learning various DGMs, such 
as deep sigmoid belief networks (Mnih & Gregor, 2014), 
for density estimation, data generation and missing value 
imputation. Our results show that we can outperform many 
strong competitors for learning DGMs. 

2. Related Work 

Recently, there has been a lot of interest in developing 
variational methods for DGMs. One common strategy 
for dealing with the intractable posterior distribution is to 
approximate it with a recognition (or inference) network, 
and a variational lower bound is then optimized (Kingma 
& Welling, 2014; Mnih & Gregor, 2014). Note in these 
methods the gradients are also estimated doubly stochas¬ 
tically. Kingma & Welling (2014) and Mnih & Gregor 
(2014) adopt variance reduction techniques to make these 
methods practically applicable. Titsias & Lazaro-Gredilla 
(2014) propose a so-called “doubly stochastic variational 
inference” method for non-conjugate Bayesian inference. 
We are inspired by these methods when naming ours. 

The reweighted wake-sleep (RWS) (Bornschein & 
Bengio, 2015) and importance weighted autoencoder 
(IWAE) (Burda et al., 2015) directly estimate the 
log-likelihood (as well as its gradient) via importance sam¬ 
pling, where the proposal distribution is characterized by a 
recognition model. These methods reduce the gap between 
the variational bound and the log-likelihood, which is 
shown much tighter than that in Kingma & Welling (2014). 
Such tighter bound results in an asymptotically unbiased 
estimator of its gradient. We draw inspiration from these 
variational methods to build our MCMC samplers. 

Our work is closely related to the recent progress on 
neural adaptive proposals for sequential Monte Carlo 
(NASMC) (Gu et al., 2015). Different from our work, 
NASMC deals with dynamical models such as Hidden 
Markov models and adopts recurrent neural network as the 
proposal. We use a similar KL-divergence as NASMC to 
learn the proposal. 

Finally, Gan et al. (2015a) adopt a Monte Carlo estimate via 
Gibbs sampling to the intractable gradients under a stochas¬ 
tic MCMC method particularly for topic models. Besides 
a general perspective which is applicable to various types 
of DGM models, we propose a neural adaptive importance 


sampler which is more efficient than Gibbs sampling and 
leads to better estimates. 

3. Doubly Stochastic Gradient MCMC for 
Deep Generative Models 

We now present the doubly stochastic gradient MCMC for 
deep generative models. 

3.1. Deep Generative Models 

Let X = be a given dataset with N i.i.d. sam¬ 

ples. A deep generative model (DGM) assumes that each 
sample G is generated from a vector of hidden vari¬ 
ables Zn G which itself follows some prior distribution 
p(z|a). Let p(x|z,/3) be the likelihood model. The joint 
probability of a DGM is as follows: 

N 

p{X,Z\d) = P[p(z„|a)p(x„|z„,/3), (1) 

n=l 

where 6 := {oL^jS). Depending on the structure of z, 
various DGMs have been developed, such as deep belief 
networks (Hinton et al., 2006), deep sigmoid belief net¬ 
works (Mnih & Gregor, 2014), and deep Boltzmann ma¬ 
chines (Salakhutdinov & Hinton, 2009). 

For most DGMs, the hidden variables z are often assumed 
to have a directed multi-layer representation z = 
where L is the number of hidden layers. Then the prior 
distribution has the factorization form: 

L-l 

p{z\6) =p{z^^^\e) p(z(')|z('+^\0), (2) 

1 = 1 

where 6) is defined by some conditional 

stochastic layer that takes as input and generates 

samples z^^\ The likelihood model is further assumed to 
be a conditional stochastic layer again: 

p{x\z,e) =p(x.\z^^\d), (3) 

where the samples are generated conditioned on the lowest 
hidden layer only. 

Various conditional stochastic layers have been developed. 
In the following we briefiy summarize the layers used in 
our experiments: 

Sigmoid Belief Network layer (SBN): A SBN layer (Saul 
et al., 1996) is a directed graphical model that defines the 
conditional probability of each independent binary variable 
^ given the upper layer as follows: 

p(^P = llz'^^^^^) = - 1 - bi), (4) 

where cr{x) = 1/(1 + e~^) is the sigmoid function, 
denotes the i-th row of the weight matrix and bi is the bias. 

Deep Autoregressive Network layer (DARN): A 

DARN (Gregor et al., 2014) layer assumes in-layer con¬ 
nections on the SBN layer. It defines the probability of 
each binary variable conditioned on both the upper 
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layer and the previous in the same layer: 

+ Wi,<iZ^^j + 6i), 

where z^^- refers to {zi \ • • • , and Wi,<i denotes 

the first i elements of the i-th row of the in-layer connection 
weight matrix. 

Conditional NADE layer: The NADE (Larochelle & 
Murray, 2011) models the distribution of high-dimensional 
discrete variables x autoregressively with an internal 
MLP (Bengio & Bengio, 2000). The dependency between 
the variables is captured by a single-hidden-layer feed¬ 
forward neural network: 

p{Xi = l|x<i) = (T(Vi,:(T(W:,<iX<i + a) + bi). 

Boulanger-Lewandowski et al. (2012) and Bomschein & 
Bengio (2015) amend this model to a conditional NADE 
layer: 

p{zP = = cr('Vi,:Cr(W:,<iZ^^- + 

X 

Uz(*+1) + a) + Ri,:z(*+1) ^ 

where we use W; to refer the sub-matrix consisting the 
first i columns of W. 

Variational Auto-Encoder layer (VAE): VAE (Kingma & 
Welling, 2014) differs from the above layers in that its out¬ 
put can be binary or real-valued variables. It contains an 
internal MLP /(z^^+^^) which encodes the parameters of 
the distribution |z^^+^^). The MLP may itself contain 
multiple deterministic layers. Eor binary output variable, 
the distribution of each individual variable is: 

p{zf^ = l|z('+i)) = (T(Wi, J(z('+1)) + bi). (6) 

Eor real-value output variable, the distribution of each inde¬ 
pendent variable is a normal distribution whose mean 
and variance are as follows: 

logcT2=W„i,/(z('+i)) + 6.i. 

Top layer and Likelihood model: The likelihood model in 
Eqn. (3) can be obtained by treating x as The distri¬ 
bution of top layer p{z^^^\6), which has no ancestral layer, 
can be obtained by simply treating the input as 0 vector or 
setting to fixed distribution, e.g., standard normal for the 
VAE layer (Kingma & Welling, 2014). Detailed descrip¬ 
tion of the layers and the model construction can be found 
in Supplementary Material. 

3.2. Variational MLE for DGMs 

Learning DGMs is often very challenging due to the in¬ 
tractability of posterior inference. One popular type of 
methods resort to stochastic variational methods under the 
maximum likelihood estimation (MLE) framework, 6 = 


argmax^ logp(X|0). These methods commonly utilize 
some variational distribution g'(z|x; 0) to approximate the 
true posterior p(z|x, 6). Eor DGMs, the variational distri¬ 
bution g(z|x; (j)) can be formalized as a recognition model 
(or inference network) (Kingma & Welling, 2014; Mnih & 
Gregor, 2014; Bomschein & Bengio, 2015), which takes x 
as inputs and outputs z stochastically. Specifically, for the 
DGMs with multi-layer representation z = de¬ 

scribed in Sec. 3.1, the variational distribution can be for¬ 
mulated as: 

L-l 

q{z\x,<f)) = q{z^^^\x,<f)) P[ q{z^^+^^\z^^\ <f)), (8) 

where each \z ^^^, 0) and q{z^^^ |x, 0) are again de¬ 

fined by some stochastic layers parametrized by 0. With 
the variational distribution, a variational bound of the log- 
likelihood logp(X|0) can be derived and optimized, e.g., 
the variational lower bound in (Kingma & Welling, 2014) 
and a tighter bound in (Burda et al., 2015). 

However, the variational bound is often intractable to com¬ 
pute analytically for DGMs. To address this challenge, re¬ 
cent progress (Kingma & Welling, 2014; Rezende et al., 
2014; Mnih & Gregor, 2014) has adopted hybrid Monte 
Carlo and variational methods, which approximate the in¬ 
tractable expectations and their gradients over the parame¬ 
ters (0, 0) via some unbiased Monte Carlo estimates. Eur- 
thermore, to handle large-scale datasets, stochastic opti¬ 
mization (Robbins & Monro, 1951; Bottou, 1998) of the 
variational objective can be used with a suitable learning 
rate annealing scheme. Note variance reduction is a key 
part of these methods to have fast and stable convergence. 

3.3. Doubly Stochastic Gradient MCMC 

We consider the Bayesian setting to infer the posterior 
distribution p(0, ZIX) ex po{0)p{Z\0)p{'X\Z,0) or its 
marginal distribution p{6\X.), by assuming some prior 
Po{0). A Bayesian formalism of deep learning enjoys sev¬ 
eral advantages, such as preventing the model from over¬ 
fitting and performing sparse/nonparametric Bayesian in¬ 
ference, as mentioned before. However, except a handful 
of special examples, the posterior distribution is intractable 
to infer. Though variational methods can be developed as 
in (Kingma & Welling, 2014; Rezende et al., 2014; Mnih 
& Gregor, 2014; Bomschein & Bengio, 2015), under some 
mean-field or parameterization assumptions, they often re¬ 
quire non-trivial model-specific deviations and may lead 
to inaccurate approximation when the assumptions are not 
properly made. Here, we consider MCMC methods, which 
are more generally applicable and can asymptotically ap¬ 
proach the target posterior. 

A straightforward application of MCMC methods can be 
Gibbs sampling or stochastic gradient MCMC (Welling & 
Teh, 2011; Ahn et al., 2012; Chen et al., 2014; Ding et al., 
2014). However, a Gibbs sampler can suffer from the 
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random-walk behavior in high-dimensional spaces. Fur¬ 
thermore, a Gibbs sampler would need to process all data 
at each iteration, which is prohibitive when dealing with 
large-scale datasets. The stochastic gradient MCMC meth¬ 
ods can lead to significant speedup by exploring statistical 
redundancy in large datasets; but they require that the sam¬ 
ple space is continuous, which is not true for many DGMs, 
such as deep sigmoid belief networks that have discrete 
hidden variables. Below, we present a doubly stochastic 
gradient MCMC with general applicability. 

3.3.1. General Procedure 

We make the mildest assumption that the parameter space 
is continuous and the log joint distribution logp(x, z|0) 
is differentiable with respect to the model parameters 
6 almost everywhere except a zero-mass set. Such an 
assumption is true for almost all existing DGMs. Then, our 
method draws samples in a collapsed space that involves 
the model parameters 9 only, by integrating out the hidden 
variables z: 

1 ^ r 

^ n J Pi^n,Zn\d) dz„, (9) 

where for discrete variables the integral will be a 
summation. Then the gradient of the log-posterior is 
V0logp(0|X) = VelogpoW + Ell V0logp(x„|0), 
where the second term can be calculated as: 


V0iogp{x.\e) = 


1 


d 


-I 


p(x|0) d9 
p{x,z\d) d 


p(x|0) do 


jp{x,z\9) dz 

logp(x, z\6) dz 


= E, 


'p(z|x,0) 


d 

— logp(x,z|6>) 


( 10 ) 


9t+i — 9t ^Pt, (12) 

Pt+i = Pt - ©Pt - AV0(7(0t+i) + vEi/V'(0, AI), 

^t+i = + A(pf+i © Pf+i — I), 

where 0 represent element-wise product, p are the aug¬ 
mented momentum variables, ^ are the diffusion factors, A 
is the step size and ^4 is a constant that controls the noise in¬ 
jected. With a proper annealing scheme over the step size 
A, the Hamiltonian dynamics will converge to the target 
posterior. 

3.3.2. Neural Adaptive Importance Sampler 
The remaining challenge is to compute the gradient as the 
expectation in Eqn. (10) is often intractable for DGMs. 
Here, we construct an unbiased estimate of the gradient by 
a set of samples {z^^^ }s=i from the posterior p(z|x, 6): 

V0logp(x|0) ~ ■ (13) 

To draw the samples a straightforward strategy is 
Gibbs sampling. Gibbs samplers are simple and applica¬ 
ble to both discrete and continuous hidden variables. How¬ 
ever, it may be hard to develop Gibbs samplers for most 
DGMs, as the highly complicated models often result in 
non-conjugacy. More importantly, a Gibbs sampler can be 
slow to mix in high-dimensional spaces. Below, we present 
a neural adaptive importance sampler (NAIS), which again 
applies to both discrete and continuous hidden variables but 
with faster mixing rates. 

Let g'(z|x;(/)) be a proposal distribution which satisfies 
g'(z|x; 0) > 0 wherever p(z IX, 6) > 0, we then have 


With the above gradient, we can adopt a stochastic gradient 
MCMC (SG-MCMC) method to draw samples of 9. We 
consider the stochastic gradient Nose-Hoover thermostat 
(SGNHT) (Ding et al., 2014). Note our method can be nat¬ 
urally extended to other SG-MCMC methods, e.g., stochas¬ 
tic gradient Langevin dynamics (Welling & Teh, 2011), 
stochastic gradient Hamiltonian Monte Carlo (Chen et al., 
2014) and high-order stochastic gradient thermostats (Li 
et al., 2016). SGNHT defines a potential energy U{9) = 
— logp(0|X) where p{9\X.) is the target posterior distri¬ 
bution, and use a random mini-batch B of the data X to 
approximate the true gradient of the potential energy: 

VeUiO) = -Ve logp^iO)-—^ e logp(x„|0). (11) 

I I neB 

We follow Gan et al. (2015a) to use the multivariate 
version of SGNHT that generate samples by simulating 
the dynamics as follows: 


V0logp(x|0) = E,(2 ;|x;0) 


p(z|x,^) d 
9(z|x; (j)) de 


logp(x,z|0) 


from which an unbiased importance sampling estimator 
can be derived with the sample weights being • 

However, computing p(z|x, 0) is often hard for most 
DGMs. By noticing that p(z|x, 9) (x p(x, z\9) and com¬ 
puting p(x, z|0) is easy, we derive a self-normalized im¬ 
portance sampling estimate as follows: 


logp(x|0) 




E b 

8 = 1^ 


(0 


, (14) 


where is a set of samples drawn from the pro¬ 
posal g(z|x; (j)) and is the unnormalized 

likelihood ratio. This estimate is asymptotically consistent 
(Owen, 2013), and its slight bias decreases as drawing more 
samples. 
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Neural Adaptive Proposals: To reduce the variance of 
the estimator in Eqn. (14) and get accurate gradient esti¬ 
mates, g'(z|x; 0 ) should be as close to j 9 (z|x, 0 ) as pos¬ 
sible. Here, we draw inspirations from variational meth¬ 
ods and learn adaptive proposals (Gu et ah, 2015) by min¬ 
imizing some criterion. Specifically, we build a recogni¬ 
tion model (or inference network) to represent the proposal 
distribution g'(z|x; 0 ) of hidden variables, as in the varia¬ 
tional methods (Kingma & Welling, 2014; Bornschein & 
Bengio, 2015). Such a recognition model takes x as input 
and outputs as samples from g(z|x; 0 ), as described 

in Sec. 3.2. We optimize the quality of the proposal dis¬ 
tribution by minimizing the inclusive KL-divergence be¬ 
tween the target posterior distribution and the proposal 
Ep(z|x,e) [log ^(z|x;^) ] (Bornschein & Bengio, 2015; Gu 
et al., 2015) or equivalently maximizing the expected log- 
likelihood of the recognition model 

= Ep( 2 |x, 0 ) [log g(z|x; </))]. (15) 

We choose this objective due to the following reasons. If 
the target posterior belongs to the family of proposal distri¬ 
butions, maximizing J{(j)]9^yi) leads to the optimal so¬ 
lution that is the target posterior; otherwise, minimizing 
the inclusive KL-divergence tends to find proposal distri¬ 
butions that have higher entropy than the target posterior. 
Such a property is advantageous for importance sampling 
as we require that g(z|x; 0) > 0 wherever p(z|x, 9) > 0. 
In contrast, the exclusive KL-divergence £((/);0,x) := 
Eg(z|x; 0 ) [log as widely adopted in the variational 

methods (Kingma & Welling, 2014; Rezende et al., 2014; 
Mnih & Gregor, 2014), does not have such a property — It 
can happen that g(z|x; 0) = 0 when p(z|x, 9) > 0; there¬ 
fore unsuitable for importance sampling. 

The gradient of J"(0; x) with respect to the parameters 
of the proposal distribution is 

V 0 j( 0 ; 0 ,x) = Ep( 2 .|x^ 0 )[V 0 log^(z|x; 0 )], (16) 

which can be estimated using importance sampling similar 
as in Eqn. (14): 



where are samples from the latest proposal 

distribution g(z|x; 0 ) and the weights are the same as in 
Eqn. (14). To improve the efficiency, we adopt stochas¬ 
tic gradient descent methods to optimize the objective 
J{(f)-,6,X) := with the gradient 

being estimated by a random mini-batch of data points B 
at each iteration: 

J {(f); X) ~ ( 1 ^) 

I I neB 


Algorithm 1 Doubly Stochastic Gradient MCMC with 
Neural Adaptive Proposals 
Input: data X 
Initialize 9, (j) 
for epoch = 1 , 2 , • • • do 

for mini-batch Bi C {1, • * * , do 
Sample ^ g'(z|xn; 0), n e Bi 

Estimate V logp(x^|0) with Eqn. (14), n e Bi 
Update 9 with Eqn. (29) 

Sample {zn ^ q{z\xn; (f)),n G Bi (optionally) 
Update (j) with the gradient in Eqn. (18) 

end for 
end for 

Output: samples of 9 


where each term \/(f)J'{(f);9,:sfn) is further estimated by 
samples as in Eqn. (17). 

With the above gradient estimates, we get the overall al¬ 
gorithm with neural adaptive importance sampling, as out¬ 
lined in Alg. 1, where we adaptively update the proposal 
distribution by performing one step of recognition model 
update after each step of SGNGT simulation. Practically, 
re-sampling the hidden variables before each updating is 
helpful to get more accurate estimations. The more detailed 
version of Alg. 1 is included in Supplementary Material. 

4. Experiments 

We now present a series of experimental results of our 
doubly stochastic MCMC method on several representa¬ 
tive deep generative models. We use the doubly stochas¬ 
tic gradient Nose-Hoover thermostat with a neural adap¬ 
tive importance sampler (DSGNHT-NAIS) in the experi¬ 
ments. In Sec. 4.2, various DGMs with discrete hidden 
variables, such as sigmoid belief networks (Neal, 1992), are 
trained on the binarized MNIST (Salakhutdinov & Mur¬ 
ray, 2008) and the Caltech 101 Silhouettes (Marlin et al., 
2010) datasets. We compare the predictive performance 
with state-of-the-art methods in terms of the estimated log- 
likelihood (Est. LL.) on the test set. We also demonstrate 
the generative performance and analyze the sensitivity to 
main hyperparameters. In Sec. 4.3, we train variational 
auto-encoders (Kingma & Welling, 2014) on the binarized 
MNIST and the Omniglot (Lake et al., 2013) datasets. 

4.1. Setup 

In our experiments, all models (including recognition mod¬ 
els) are initialized following the heuristic of Glorot & Ben¬ 
gio (2010). We set the Student-t prior to all the model 
parameters. We use the reformulated form of multivariate 
SGNHT as described in Supplementary Material. The per- 
batch learning rate 7 is set among {0.01, 0.005, 0.001}, 
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Table 1. MNIST results of various methods on five benchmark architectures. “Dim” denotes the number of hidden variables in each layer, 
with layer closest to the data laying left. Values within brackets are variational lower bounds, values without brackets are estimated log- 
likelihoods. (:*:) Use NADE layers for recognition model. The results of NVIL are from Mnih & Gregor (2014); the results of Wake-sleep 
and RWS are from Bomschein & Bengio (2015); and the results of Data Augmentation (DA) are from Gan et al. (2015b). 


Model 

Dim 

NVIL 

Wake-Sleep 

RWS 

DA 

DSGNHT-Gibbs 

DSGNHT-NAIS 

SBN 

200 

(-113.1) 

-116.3(-120.7) 

-103.1 

(-113.02) 

-102.9 

-101.8 

SBN 

200-200 

(-99.8) 

-106.9(-109.4) 

-93.4 

(-110.74) 

-100.6 

-92.5 

SBN 

200-200-200 

(-96.7) 

-101.3(-104.4) 

-90.1 

- 

-97.5 

-89.9 

DARN 

200 

- 

- 

-89.2* 

(-102.11) 

-101.1 

-89.3* 

NADE 

200 

- 

- 

-86.8* 

- 

- 

-83.7* 


SBN 200 


d -101.7 

« -101.9 

- 102.1 

cn 

^ -102.3 


10 ° 10 ^ 10 ^ 
DARN 200 



SBN 200-200-200 




Figure 1 . Log-likelihood estimation on MNIST for different mod¬ 
els w.r.t number of posterior samples M used for posterior mean 
estimator. Dotted line marks the results reported in Table 1. 


from which we report the experiment with best perfor¬ 
mance. If not noted otherwise, the number of samples used 
during training is set to S' = 5. The mini-batch size \B\ 
is set to 100 for all experiments. The parameters of recog¬ 
nition model are updated using the Adam (Kingma & Ba, 
2015) optimizer with step sizes of {1, 3, 5} x 10“^. 

As our method infers the posterior p{0\X.), we adopt the 
posterior mean estimator for model evaluation: 

1 ^ 

m=l 

To compute the posterior mean, we start to collect posterior 
samples when we observe the Est. LL. on the validation 
set does not increase for 10 consecutive epochs. Then M 
samples }m=i ^ more epochs are averaged for 
final evaluation. If not mentioned otherwise, the number 
of samples used for computing the posterior mean is set to 
M = 100. We will also show how M influences the results. 


To evaluate the inferred model 0 in terms of Est. LL., we 
adopt the AT-sample importance weighting estimation Ck’ 


~q(z|x;0) 



p(x, 

|x) 


( 20 ) 


Such estimation is also used by Bomschein & Bengio 
(2015) and Burda et al. (2015). We will clarify how we 
set the number of samples K used for estimating the log- 
likelihoods and investigate how it influences the quality of 
the estimator. 




Figure 2. Log-likelihood estimation on MNIST w.r.t (Left) num¬ 
ber of samples S used during training and (Right) number of sam¬ 
ples K used during estimating the log-likelihood. Dotted line 
marks the results estimated with S' = 5 and K = 10^ for SBN 
and K = 500 for DARN and NADE , as reported in Table 1. 


DARN and NADE are not permutation-invariant models. 
In our experiments, the ordering is simply determined by 
the original order in the dataset. 

See Supplementary Material for more details about the ex¬ 
perimental setting. 

4.2. Discrete Hidden Variable Models 

4.2.1. Binarized MNIST 

The binarized MNIST dataset consists of 50, 000 training 
samples, 10,000 validation samples and 10,000 test sam¬ 
ples. 

We consider flve benchmark models: three SBN mod¬ 
els, one DARN model and one NADE model. Eor the 
three models with SBN layers, we also use SBN layers 
to construct the recognition model; for the two models 
with DARN layer and NADE layer, we follow Bomschein 
& Bengio (2015) to use NADE layer for the recognition 
model. The model sizes and the results are summarized 
in Table 1. Details of the construction of the models are 
summarized in Supplementary Material. 

We first investigate the effect of the neural adaptive impor¬ 
tance sampler (NAIS). Eor comparison, we also estimate 
the gradient Eqn. (10) by directly sampling from p(z|x, 0) 
using a Gibbs sampler. (The derivation for the Gibbs sam¬ 
pler is included in Supplementary Material.) We denote the 
resulting method by DSGNHT-Gibbs. In Table I we ob¬ 
serve that the DSGNHT using a NAIS consistently outper- 
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Figure 3. Visualization: (a) Training data, (b) Samples from the learned models: NADE/NADE 200 for MNIST and NADE/NADE 150 
for Caltech 101 Silhouettes, (c) Eeatures at the bottom layer learned with sparse prior, (d) Eeatures at the bottom layer learned with 
normal prior. (Top) MNIST. (Bottom) Caltech 101 Silhouettes. 


Table 2. MNIST results of various methods and models. Results 
are taken from [1] Bomschein & Bengio (2015), [2] Larochelle & 
Murray (2011), [3] Uria et al. (2014), [4] Gregor et al. (2014), [5] 
Salakhutdinov & Murray (2008), [6] Raiko et al. (2014), [7] Mur¬ 
ray & Salakhutdinov (2009), [8] Burda et al. (2015). o Trained on 
the original MNIST dataset. 


Methods (Models) 

Est. test LL. 

DSGNHT-NAIS (NADE/NADE 200) 

-83.67 

RWS (NADE/NADE 250) [1] 

-85.23 

RWS (ARSBN/SBN 500) [1] 

-84.18 

NADE (500 hidden units) [2] 

-88.86 

EoNADE 2hl (128 orderings) [3] 

-85.10 

DARN (500 hidden units) [4] 

-84.13 

RBM (500 hidden units) [5] 

-86.34 

EoNADE-5 2HL(128 Ords) [6] 

-84.68 

DBN 2hl [7] ^ 

-84.55 

IWAE 2sl [8] 

-85.32 


forms the DSGNHT using a Gibbs sampler, especially for 
deeper models and autoregressive models, since the model 
parameters are higher-dimensional and highly correlated. 
We then compare our method to several other state-of-the- 
art methods on the five benchmark models. We observe that 
our method outperforms RWS almost on all models, except 
for DARN 200, on which we are slightly worse than RWS. 

We compare our best result to the state-of-the-art results 
on binarized MNIST in Table 2. The NADE/NADE 200 
model achieves an Est. LL. of —83.67, which outperforms 
most published results. Gregor et al. (2015) give a lower 
bound —80.97, which exploits spatial structure. IWAE 



Figure 4. MNIST missing data prediction by SBN 200-200-200: 
(Top) Original data. (Middle) Hollowed data. (Bottom) Recon¬ 
structed data. 

(Burda et al., 2015) achieves —82.90, which is trained on 
the original MNIST dataset (Lecun et al., 1998) and thus 
not directly comparable. We cite their results on the bina¬ 
rized MNIST in Table 2. 

In Eig. 1, we investigate the infiuence of the number of sam¬ 
ples M on the posterior mean estimator Eqn. (19). We can 
observe that on all models using more samples for poste¬ 
rior mean brings consistent improvements. On SBN/SBN 
models using M = 100 samples improves around 0.6 nat 
than using M = 1 (in which case one posterior sample 
is estimated). On autoregressive models, using M = 100 
samples brings an improvement more than 2 nats. 

We show the infiuence of the number of samples S used 
during training in Eig. 2(a). we observe that the Est. LL. 
on test data improves as S grows up. Eig. 2(b) presents 
the curves of the final estimated test log-likelihood with re- 
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Table 3. Caltech 101 Silhouettes results of various training meth¬ 
ods and models. Results are taken from [1] Bornschein & Bengio 
(2015), [2] Cho et al. (2013), [3] Raiko et al. (2014). f Results are 
produced using the authors’ published code. 


Methods (Models) 

Est. test LL. 

DSGNHT-NAIS (SBN/SBN 200) 

-122.7 

DSGNHT-NAIS (SBN/SBN 200-200) 

-108.0 

RWS (SBN/SBN 200)t 

-134.4 

RWS (SBN/SBN 200-200)t 

-126.0 

DSGNHT-NAIS (SBN/SBN 200-200-200) 

-105.2 

DSGNHT-NAIS (SBN/SBN 300-100-50-10) 

- 103.6 

DSGNHT-NAIS (NADE/NADE 150) 

- 100.0 

RWS (SBN/SBN 300-100-50-10) [1] 

-113.3 

RWS (NADE/NADE 150) [1] 

-104.3 

NADE (500 hidden units) [1] 

-110.6 

RBM (4000 hidden units) [2] 

-107.8 

NADE-5 (4000 hidden units) [3] 

-107.3 


spect to the number of samples K used for the estimator 
Eqn. (20). We observe that K = 100,000 and K = 500 
are large enough to get a good Est. LL. for SBN and 
DARN(NADE) models, respectively. 

Eig. 3 visualizes the generative performance of the learned 
models. In Eig. 3(a), we show the randomly sampled train¬ 
ing data of MNIST and Caltech 101 Silhouettes. Eig. 3(b) 
displays the examples generated from the learned models. 
We observe that the generated samples are visually good. 

One advantage of Bayesian framework is that we can spec¬ 
ify sparsity-encouraging priors on the model parameters 
explicitly, e.g., the Student-t prior in our experiments. 
Eig. 3(c) and Eig. 3(d) demonstrate the difference between 
features learned with a sparse (Student-t prior) prior and a 
non-sparse (Gaussian) prior. We observe that the features 
learned with a sparse prior appear more localized. 

We further demonstrate the ability of the learned models 
on predicting missing data. Eor each test image, the lower 
half is assumed missing and the upper half is used to infer¬ 
ence the hidden units (Gan et al., 2015b). Then, with the 
hidden units, the lower half is reconstructed. Prediction is 
done by repeating this procedure and finally sampling from 
the generative model with the inferred hidden units. Eig. 4 
demonstrates some example completions for the missing 
data on MNIST. 

4.2.2. Caltech 101 Silhouettes 

The Caltech 101 Silhouettes dataset consists of 4,100 train¬ 
ing samples, 2, 264 validation samples and 2,307 test sam¬ 
ples. We first compare our method to RWS on two bench¬ 
mark models in Table 3 (Top) and observe that our method 
achieves significant improvements. On SBN/SBN 200- 
200, we get a test Est. LL. of —108.0 which improves over 
RWS for 18 nats. 

Table 3 (Bottom) summarizes ours best results and other 
state-of-the-art results. Our NADE/NADE 150 network 


Table 4. Results of log-likelihood estimation on single-stochastic- 
layer variational auto-encoder. Results of VAE and IWAE are 
taken from Burda et al. (2015). 


Dataset 

VAE 

IWAE 

DSGNHT-NAIS 

binarized MNIST 

-88.83 

-87.63 

- 86.93 

Omniglot 

-107.62 

-106.12 

- 106.10 


reaches a test Est. LL. of —100.0, which improves RWS 
on the same model for 4.3 nats. We observe a remarkable 
effect of increasing the number of samples M for poste¬ 
rior mean: the test Est. LL. of —100.0 at M = 100 im¬ 
proves 5.3 nats compared to—105.3atM = l. Gan et al. 
(2015b) achieve —96.40 by training EVSBN (Erey, 1998) 
with both training data and validation data. A latest work 
by Goessling & Amit (2015) achieves —88.48 by devel¬ 
oping a mixture model of sparse autoregressive network. 
Eig. 3(b) visualizes the samples drawn from the learned 
models. 

4.3. Variational Auto-Encoders 

Einally, we consider the DGMs with continuous hidden 
variables. One popular example is the variational auto¬ 
encoder (VAE) (Kingma & Welling, 2014). Intuitively, 
the posterior of DGMs with continuous hidden variables is 
harder to capture, as the hidden variables have much more 
freedom compared to that of DGMs with discrete hidden 
variables. Such freedom potentially results in high vari¬ 
ance of the gradients estimation. VAE and the importance 
weighted auto-encoders (IWAE) (Burda et al., 2015) alle¬ 
viate this problem by adopting a reparametrization trick. 
Then the variational parameters (j) can be optimized tying 
to the generative model. 

In our DSGNHT-NAIS, we indeed observe high variance of 
the gradients estimation. However in theory, any distribu¬ 
tion that satisfies g(z|x;(/)) > 0 wherever p(z|x, 0) > 0 
can be used as a proposal. Such a property makes any 
other reasonable objectives (instead of the inclusive KL- 
divergence described in Sec. 3.3) for g'(z|x;(/)) is adopt- 
able. We adopt the same objective in IWAE for the proposal 
distribution ^ and find it works well in practice. 

We follow IWAE to train a single-stochastic-layer VAE 
with 50 hidden units. In between the data and the hidden 
variables are two deterministic layers with tanh activation. 
The model is trained on the binarized MNIST and the Om- 
niglot datasets. We use the Omniglot dataset downloaded 
from Burda et al. (2015) which consists of 23, 000 training 
samples, 1,345 validation samples and 87,00 test samples. 
We use M = 200 for the posterior mean estimator and fol¬ 
low IWAE to use iT = 5,000 to evaluate the test Est. LL.. 
Since IWAE also adopts an importance sampler to estimate 

^We use the objective in IWAE for optimizing the proposal 
distribution ^(z|x; (p) only, leaving other part of our method un¬ 
changed. 
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the objective (as well as its gradient), we compare the re¬ 
sults using S = b samples during training for both IWAE 
and our method. We achieve comparable or better results 
as summarized in Table 4. 

5. Conclusions and Future Work 

We propose a powerful Bayesian inference method based 
on stochastic gradient MCMC for deep generative models 
with continuous parameter space. It enjoys several advan¬ 
tages of Bayesian formalism such as sparse Bayesian infer¬ 
ence. Our results include state-of-the-art performance on 
standard published datasets. 

For future work we like to investigate the performance on 
learning sparse Bayesian models. Also, learning nonpara- 
metric Bayesian DGMs is another interesting challenge. 
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A. Model Setup 

We describe how the models are constructed with the con¬ 
ditional stochastic layers. Each model should consists of 
a top layer a data layer and (op¬ 

tionally) several intermediate layer p(z^^^|z^^+^\0). The 
output of the layer (/ + 1) (random samples) is passed as 
the input to the layer I and thus a full generative process for 
the data is built. In principle, the type of each layer can be 
chosen arbitrarily, as long as the input dimension and the 
output dimension of adjacent layers match to each other. 

The recognition model can be constructed in a similar way. 
Given a generative model with L hidden layers, the recog¬ 
nition model should also contains L stochastic layers. The 
first layer takes x as input and outputs random samples of 
The output of the layer I (random samples of z^^^) is 
passed as the input to the layer / -f-1. Note the type of each 
layer can also be chosen arbitrarily, as long as the dimen¬ 
sions of adjacent layers match to each other and the output 
dimension of the l-th layer matches the input dimension of 
the l-th layer of the generative model. 

In our experiments, all tested models and their recognition 
models consist only one type of stochastic layer. In the fol¬ 
lowing we describe the detailed architectures of our tested 
models. 

SBN/SBN models: 

For all models with SBN layers we construct the recogni¬ 
tion model with SBN layers too. We use four SBN/SBN 
architectures in the experiments: 1-hidden-layer with 200 
hidden units (SBN/SBN 200); 2-hidden-layer with 200 
hidden units in each hidden layer (SBN/SBN 200-200); 
3-hidden-layer with 200 hidden units in each hidden 
layer (SBN/SBN 200-200-200); and 4-hidden-layer with 
300(closest to data), 100, 50, 10 hidden units (SBN/SBN 
300-100-50-10). 

We use the following top layer (equivalent to factorized 
Bernoulli distribution): 

= 1|0) = a{bi), (21) 

and the likelihood model: 

p{zy = l|z('+i),0) = + by), (22) 

where we define z^^^ = x. The model parameters are 6 = 

DARN/NADE models: 

we follow Bornschein & Bengio (2015) to use NADE lay¬ 
ers in the recognition model for the DARN models. We test 
a shallow model (1-hidden-layer DARN/NADE 200) in our 
experiments. We use the following top layer (equivalent to 
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FVSBN (Frey, 1998)): 

+ b^^), (23) 

and the likelihood model: 


pizP = i\z%z^^+^\e) = 

ct(U«z('+i)+W« z«+ 6 f)). 


(24) 


The model parameters are 6 = U 

NADE/NADE models: 

We test a shallow model (1-hidden-layer NADE/NADE 
200) in our experiments. We use the following top layer: 


p( 2 :P = l|zlf^\ 6 >) = 




(25) 


reparametrization trick (Kingma & Welling, 2014): 

V r =v F i„„ - \ p(x,zW) 

=V<^Ee(fc)^p(e) 


=E 


e(^)~p(€) 


=E 


e(fc)r^p(e) 


log-Y^ 

K ^k=l g(z(^) |x) 

log 2 , X, </))) 

^4- log Efc=i > X, </))) 

y], ■WfeV0logu;(x,z(el''l,x,</))) 


where we have omitted the model parameters 6 in the 
above gradients, since 9 is fixed when learning the pro¬ 
posal distribution. In the above derivations, • • • , 
are the auxiliary variables as defined in VAE. Wk = 

M;(x,z(eW,x,(/))) = are the importance 

weight and Wi are the normalized importance weights as 
defined in IWAE. 


B. Experimental Setup 


and the likelihood model: 


= . 0 + 1 ) = 


p{zl ^ = 1 






(0 JI) 


<i^<i 


_,_U(0z('+i) +a('l) + R^';’zl'+il +6fl), (26) 


The model parameters are 6 = {IjO), r0)}^^^i u 

{V 0 ),w 0 ),a 0 ),b 0 )}f^o. 

VAE/VAE models: 

Eor the VAE model, we follow Kingma & Welling (2014) 
to use an isotropic multivariate Gaussian top layer: 

p{z^P = l\e) =J\f{ 0 ,l). (27) 


The VAE stochastic layer itself contains an internal MLR In 
our experiments, we train single-stochastic-layer VAE with 
50 hidden units. In between the data and the hidden vari¬ 
ables are two deterministic layers with tanh activation. The 
dimension of the two deterministic layers are both 100. The 
recognition model is a stochastic VAE layer within which 
are two 100-dimensional deterministic layers. Such an ar¬ 
chitecture is used in Burda et al. (2015). 

In experiments of training VAE, we adopt the objective in 
IWAE (Burda et al., 2015) for learning the proposal distri¬ 
bution g(z|x; 0): 




^g(z|x; 0 ) 


1 ^ 


p(x, Z*^^)) 


X ^ g(z(^)|x)_ 


(28) 


Then the gradient can be evaluated by adopting the 


We describe our experimental setup here, including the pa¬ 
rameter setting and implementation details. 

In our implementation, we use the reformulated form of 
multivariate SGNHT (Ding et al., 2014): 

0t+i = 0t + ut, (29) 

ut+i = ut-^tOut- r]VeU (^t+i) + A/'(0, 2ar]l), 
at+i = at + (pt+i © Pt+i - r]l), 

where we have setting u = Ap, 77 = A2, a = and 
a = AA. This reformulation is cleaner and easier to imple¬ 
ment. In analog to SGD with momentum, 77 is called the 
learning rate and 1 — a are the momentum terms (Chen 
et al., 2014). The initialization of SGNHT is as follows: 
u is random sampled from A/'(0, r]l) and a is initialized 
as al. There are three parameters for SGNHT: the learn¬ 
ing rate 77 , the momentum decay a, and the mini-batch size 
B. In our implementation we choose the mini-batch size 
B = 100, the momentum decay a = {0.1,0.01}. Eor 
numerical stability, we choose V = where 7 is called 
the “per-batch learning rate” (Chen et al., 2014). The per- 
batch learning rate 7 is chosen from {0.01, 0.005, 0.001} 
with best performance. 

Eor the recognition model, we use Adam (Kingma & Ba, 
2015) to learn the parameters 0. There are four parameters 
for Adam: the stepsize 77 ', the exponential decay rates {/3i, 
^ 2 } and e which is used to prevent division by zero. In 
our implementation we choose Pi = 0.9, ^^2 = 0.999 and 
e = 10“^^. The stepsize 77 ' is chosen from {1, 3, 5} x 10“^ 
with best performance. 

The model parameters are initialized following the heuris¬ 
tic of Glorot & Bengio (2010). The Student-t’s prior for all 
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Algorithm 2 A Detailed Version of Doubly Stochastic Gradient MCMC with Neural Adaptive Proposals 
Input: X = {xi, • • • , xat}: the dataset 
Input: S: number of samples used during training 

Input: M: number of samples used for computing the posterior mean estimation Eqn. (19) 

Input: 7 : per-batch learning rate for SGNHT, \B\ \ mini-batch size, a: momentum decay 
Input: r]'\ step size for Adam 

Input: UQ^ricj)'. number of 0 or 0 update during each mini-batch 
Initialize 6, (j)\ following the heuristic of Glorot & Bengio (2010) 

Initialize u ^ A/’( 0 , 77 !), a = al 
for epoch = 1 , 2 , • • • do 

Randomly split the data X into mini-batches 5i, • • • , 
for mini-batch Bi in {5i, • • • , Bj^i\b\ } do 

for = 1, • • • , do 

Sample {zn from the proposal g'(z|xn; 0), n G Bi 
Estimate the gradients V logp(x^|0) with Eqn. (14), n e B^ 

Update 6 with Eqn. (29) 

for 70 = 1, • • • ,^0 do 

Sample {zn from the proposal g(z|x^; 0), n e Bi 
Estimate the gradients V 0 ^^(0; 6, X) with Eqn. (18), n e Bi 
Update (f) using Adam optimizer with V 0 J^(0; X) 

end for 
end for 
end for 
end for 

Run another M epochs to estimate the posterior mean 6 

Output: 6 


model parameters are set with a scale parameter a = 0.09, 
location parameter /i = 0 and degrees of freedom u = 2.2. 

Our method involves updating the generative model param¬ 
eters 9 and the recognition model parameters 0 together 
(one step of 9 update and one step of 0 update within each 
mini-batch). One natural extension is to make the num¬ 
bers of the two type of updates adjustable. We thus set the 
parameter ne which controls the number of steps of 9 up¬ 
date within each mini-batch, and parameter n 0 which con¬ 
trols the number of steps of 0 update following each step 
of 9 update. Larger no can potentially make the samples 
of 9 less correlated. Larger can potentially make the 
proposal distribution more accurate. We set = 10 and 
770 = 1 in our implementation. 

Einally, in Alg. 2 we summarize a detailed version of our 
method. 

C. Derivations 

We provide the derivations of the Gibbs sampler for 
the DSGNHT-Gibbs. The hidden variables are sampled 
layer-wisely and dimension-wisely. We define z*^^^ = x 
and = 0 for convenience. Then the probability 

79 ( 2 :^^ |z^^], X, z^^^^) can be written as 79 ( 2 ;^^ |z^^], z*^^^^). 


We have the following Gibbs sampler: 

^, zl*], , z(>*)) 

=p(z«0|z(')).p(7)|z('+i)) 


xexp 


(w('7(*+i)+7))^(o_iog(i+g(w: 






(X exp 


i'=i 

- E 






A Gibbs sampler for DARN can be derived similarly. 












